Skip to content

Port Qwen2.5-VL Model#2574

Open
jaytiwarihub wants to merge 12 commits intokeras-team:masterfrom
jaytiwarihub:feat/qwen2-vl
Open

Port Qwen2.5-VL Model#2574
jaytiwarihub wants to merge 12 commits intokeras-team:masterfrom
jaytiwarihub:feat/qwen2-vl

Conversation

@jaytiwarihub
Copy link
Contributor

@jaytiwarihub jaytiwarihub commented Feb 4, 2026

PR Title

[Model] Add Qwen2-VL Model Architecture and Preprocessing

PR Description

What does this PR do?
This PR implements the Qwen2-VL (Qwen2 Vision-Language) model architecture in Keras 3. Qwen2-VL is a state-of-the-art multimodal model that introduces "Naive Dynamic Resolution" support, allowing it to process images of arbitrary aspect ratios by converting them into dynamic grids.

Key Components Implemented:

  • Qwen2VLVisionEncoder: A 3D Vision Transformer backbone that supports 3D convolution patch embeddings and 3D Rotary Positional Embeddings (RoPE) to handle video and dynamic image inputs.
  • Qwen2VLImageConverter: A preprocessing layer that implements the "Smart Resizing" logic, resizing images to optimal grid sizes based on the patch size (14x14) to minimize padding.
  • Qwen2VLProjector: A lightweight MLP adapter that projects visual features into the LLM's embedding space.
  • Qwen2VLCausalLM: The end-to-end model class connecting the vision tower with the Qwen2 text backbone.
  • Qwen2VLCausalLMPreprocessor: The high-level preprocessor handling text tokenization and image tensor conversion.

Technical Details:

  • 3D RoPE: Implemented custom rotary embeddings that account for Time, Height, and Width dimensions relative to the grid structure.
  • Dynamic Resolution: The vision encoder accepts inputs with variable spatial dimensions (processed via the image converter), enabling the model to "scan" images in their native aspect ratios.
  • MHA Compatibility: The attention mechanism creates standard Query/Key/Value splits compatible with Keras MultiHeadAttention.

Tests:

  • Added unit tests for all components (backbone_test, projector_test, image_converter_test).
  • Added an integration test (integration_test.py) verifying the end-to-end flow from raw text/image input to preprocessed tensors.
  • Verified shape correctness for both 3D (static image) and 5D (video) input tensors.

Reference:

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jaytiwarihub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for the Qwen2.5-VL model, integrating vision processing capabilities into the existing Qwen language model. It includes the core components for encoding visual information, projecting it into the text embedding space, and fusing it with text embeddings for multimodal processing. The changes also include modifications to the preprocessor to handle text-only inputs gracefully.

Highlights

  • Qwen2.5-VL Architecture Implementation: This pull request introduces the initial skeleton for the Qwen2.5-VL architecture, including the vision encoder, projector, and backbone.
  • Vision Encoder (ViT Structure): The Qwen2VLVisionEncoder is defined, implementing a Vision Transformer (ViT) structure for processing image inputs.
  • Vision to Text Projection: The Qwen2VLProjector is defined to downsample vision features and project them into the text embedding space, facilitating fusion with text embeddings.
  • Backbone Wiring: The Qwen2VLBackbone connects the vision processing components to the existing Qwen text backbone, enabling multimodal processing.
  • Image Handling: The preprocessor is modified to handle None images, skipping vision layers when only text input is provided.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_hub/src/models/gemma3/gemma3_causal_lm_preprocessor.py
    • Modified to handle None images by skipping vision layers when only text input is provided.
    • Removed unused responses parameter from generate_preprocess function.
    • Removed unnecessary calculations when images is None
  • keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py
    • Added Qwen2VLBackbone class to wire vision components to the existing Qwen text backbone.
    • Implements the forward pass for processing images and text, fusing their embeddings, and passing them through the LLM.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_projector.py
    • Added Qwen2VLProjector class to downsample vision features and project them into the LLM's hidden size.
    • Includes logic for merging 2x2 neighboring patches into a single token.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py
    • Added Qwen2VLVisionEncoder class implementing a Vision Transformer (ViT) structure.
    • Includes a 3D convolution layer for handling video and images, transformer blocks, and a patch merger.
Activity
  • Initial implementation of Qwen2.5-VL architecture.
  • Definition of vision encoder, projector, and backbone components.
  • Modification of preprocessor to handle None images.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the initial skeleton for the Qwen2.5-VL model, including the backbone, vision encoder, and projector components, and refactors the Gemma3 preprocessor. While a good start, several areas require attention to align with repository standards. Specifically, new backbones should utilize the Keras Functional API, docstrings and get_config methods are missing, and there are implementation issues in Qwen2VLBackbone and Qwen2VLVisionBlock. Additionally, a minor code duplication issue was found in gemma3_causal_lm_preprocessor.py.

combined_embeddings = keras.ops.concatenate([image_embeddings, text_embeddings], axis=1)

# Pass through the LLM
x = self.text_backbone.transformer_layers(combined_embeddings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

self.text_backbone.transformer_layers is a list of layers, not a single callable layer. This will raise an error. You need to iterate through the layers in a loop.

Additionally, this approach of accessing internal layers of self.text_backbone breaks encapsulation and is brittle. It would be better to either reuse the text_backbone's call method or restructure the model. The padding_mask input is also missing from the call to the transformer layers.

# We will squeeze these back at the end.
batched = True

batched = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This batched = True assignment is a duplicate of the one on line 681 and can be removed.

@@ -0,0 +1,41 @@
import keras
from keras_hub.src.models.backbone import Backbone
from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a trailing whitespace at the end of this line.

Suggested change
from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone
from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone


def call(self, x, grid_thw=None):
# x shape: (Batch, Time, Height, Width, Channels)
x = self.patch_embed(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a trailing whitespace at the end of this line.

Suggested change
x = self.patch_embed(x)
x = self.patch_embed(x)

@jaytiwarihub jaytiwarihub marked this pull request as ready for review February 7, 2026 17:48
@jaytiwarihub
Copy link
Contributor Author

Note to maintainers: Please update the PR title to [Model] Add Qwen2-VL Model Architecture and Preprocessing (I cannot see the edit button on my end)

@sachinprasadhs sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 9, 2026
@jaytiwarihub jaytiwarihub changed the title [WIP] Port Qwen2.5-VL Model Port Qwen2.5-VL Model Feb 13, 2026
This was referenced Feb 16, 2026
Comment on lines +15 to +19
from keras_hub.src.models.qwen2_vl.qwen2_vl_causal_lm import Qwen2VLCausalLM
from keras_hub.src.models.qwen2_vl.qwen2_vl_projector import Qwen2VLProjector
from keras_hub.src.models.qwen2_vl.qwen2_vl_vision_encoder import (
Qwen2VLVisionEncoder,
)
Copy link

@samudraneel05 samudraneel05 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from keras_hub.src.models.qwen2_vl.qwen2_vl_causal_lm import Qwen2VLCausalLM
from keras_hub.src.models.qwen2_vl.qwen2_vl_projector import Qwen2VLProjector
from keras_hub.src.models.qwen2_vl.qwen2_vl_vision_encoder import (
Qwen2VLVisionEncoder,
)
from keras_hub.src.models.qwen2_vl.qwen2_vl_presets import backbone_presets
from keras_hub.src.utils.preset_utils import register_presets
register_presets(backbone_presets, Qwen2VLBackbone)

the other imports inside the init are unecessary and go against repo standards

"""Qwen2-VL Backbone model.

This backbone combines the Vision Encoder and the Text Backbone.
It follows the KerasHub Functional API pattern.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird line, better to remove and instead add parameters or args

Comment on lines +1 to +13
# Copyright 2024 The KerasHub Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 The KerasHub Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

why is this here when it does not exist on any other model's backbone file?

Comment on lines +1 to +13
# Copyright 2024 The KerasHub Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 The KerasHub Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

is not present in any other model's init file, curious as to why you put it here?

@sachinprasadhs
Copy link
Collaborator

@jaytiwarihub, As you have mentioned the implementation is about Qwen2.5-VL, could you please rename the directories accordingly.
Also, transformers have different directory and implementation for Qwen2.5-VL and Qwen2-VL
So, @samudraneel05 You can continue on Qwen2-VL and @jaytiwarihub will work on Qwen2.5-VL.
And we will have a separate Qwen3-VL implementation.

All the communications are made clear in these issue threads and respective contributors working on these are assigned for these issues as well.
#2172
#2323
#2570

Hope this clarifies all the confusion.
Thanks for showing interest in contributing the models.

@jaytiwarihub
Copy link
Contributor Author

@samudraneel05 thanks for help, i appreciated and those weird lines are just overwhelm feeling of mine , i'll take care of it

Copy link
Collaborator

@sachinprasadhs sachinprasadhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
I just went through the PR high level, the code looks incomplete and also does not follow the Keras Hub design principles and guidelines.
Please refer https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING_MODELS.md for details

@jaytiwarihub
Copy link
Contributor Author

@sachinprasadhs thankyou for your kind review ! , I'm working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model For PRs that contribute a new model to the Keras Hub registry.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants